I chose to do my final project on the Titanic Data Set. I chose this particular dataset because of it’s popularity among young data scientists. It is one of the easiest datasets to begin with for learning to build a predictive model. I wanted to get familiar with it so that I can do this myself in the near future, but also thought it would be a fun topic to do the project on.
https://www.kaggle.com/c/titanic & https://raw.githubusercontent.com/rashida048/Datasets/master/titanic_data.csv
Kaggle has an ongoing model competition to see who can build one with the best accuracy
My Question is: Can you predict the likelihood of surviving the Titanic disaster? What characteristics of a person can increase/decrease chances of survival?
The Titanic dataset contains information about each passenger of the Titanic and their fate.
Unique ID for each Passenger (dbl)
Indicates if passenger survived (dbl): 0 = died, 1 = survived
Ticket Class (dbl): 1 = 1st, 2 = 2nd, 3 = 3rd
Name of passenger (chr)
Sex of passenger (chr)
Age of passenger (chr)
Number of siblings/spouse (dbl): Sibling = Brother, Sister, Stepbrother, Stepsister Spouse = Husband, Wife (Mistresses and Fiances ignored)
Number of parents/children (dbl): Parent = Mother, Father Spouse = Daughter, Son, Stepdaughter, Stepson (Children with nanny = 0)
Ticket Number of passenger (chr)
Price of passenger ticket (dbl)
Cabin Number of passenger (chr)
Port of Embarkation (chr): Q = Queenstown,S = Southampton, C = Cherbourg
## # A tibble: 6 × 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Braund… male 22 1 0 A/5 2… 7.25 <NA>
## 2 2 1 1 Cuming… fema… 38 1 0 PC 17… 71.3 C85
## 3 3 1 3 Heikki… fema… 26 0 0 STON/… 7.92 <NA>
## 4 4 1 1 Futrel… fema… 35 1 0 113803 53.1 C123
## 5 5 0 3 Allen,… male 35 0 0 373450 8.05 <NA>
## 6 6 0 3 Moran,… male NA 0 0 330877 8.46 <NA>
## # … with 1 more variable: Embarked <chr>
## # A tibble: 6 × 5
## Survived Pclass Sex Embarked cat
## <chr> <chr> <chr> <chr> <chr>
## 1 Died Third Male Southampton Adult
## 2 Survived First Female Cherbourg Adult
## 3 Survived Third Female Southampton Adult
## 4 Survived First Female Southampton Adult
## 5 Died Third Male Southampton Adult
## 6 Died Third Male Queenstown <NA>
Class is cateogorical and is split into 3 categories of (1,2,3). Fare is numeric. There is a moderate negative correlation (-0.55) between the two variables. This indicates that as the price of the fare increases, class decreases. The assumption that higher fares are associated with 1st class (1) can already be made, but it is nice knowing that it is statistically correct as well.
Low negative correlation (-0.37). Indicates as Age increases, class decreases (closer to 1st class). This assumption can be made that older people have more money than younger passengers.
Class has a low negative correlation (-0.36). Class gets worse (economically) as it increases. (1st class is 1, 2nd is 2, and 3rd is 3). Therefore, wealthier people are more likely to be part of first class. ‘Survived’ indicates that the passenger died (0) or survived (1), meaning the higher the variable, the better chances of survival. Therefore, the correlation indicates that as survival increases, class decreases. This indicates that wealthier people (people of 1st class), or more likely to survive
Sex has a moderate positive correlation (0.54). As sex increases (man (1) to woman (2)), so does survival (death (1) to survival (2)). This indicates that women were more likely to survive.
This plot shows that mostly Adults (Ages 20-40) were on board. Middle-Aged Adults the second most common age category. There appears to be an upwards trend until Adults, and a downwards trend following Adults. There are also more male passengers than female passengers in just about every age category.
This Map shows the 3 ports that the Titanic departed from. Southampton can be observed as the port with the largest number of passengers. A little of over half of the passengers from Southampton are third class. The remaining are split up somewhat evenly. The second largest port is Cherbourg, which had over half of its passengers in first class. The smallest port is Queenstown, which had 77 passengers and 72 of them were third class.
The plot shows the volume of passengers, as well as fare price for each class. Third class is the largest, first class in the middle, and second class last largest. As expected, the price goes up as you get closer to First Class.
This plot shows the survival rate of passengers by their age category and gender. It can be observed that Females survived at a significanly higher rate than Males. For women, it appears that the survival rate for babies and seniors was 100%. All other cateogories seem to be similar. For men, the chances of survival decrease as age increases.
It can be observed that Southampton had the largest fatality rate of 66.3% of its passengers. Queenstown was a close second at 61%. Cherbourg was the only one at less than 50% at 44.6%.
This graph shows the death rate of individuals based on their port and class. The common theme was that you were more likely to die as you get closer to third class. Southhampton appeared to have a larger death rate in all categories. Maybe passengers from Southampton were put in a similar (unlucky) part of the ship?
This graph shows the Survival Rate of individuals based on whether or not they had family on board. In general, 51.6% of people with family on board survived. 33.2% of people traveling alone died. Men who were alone had a 16.8% survival rate, while men with family had a 28.2% survival rate. Women traveling alone had 79% survival rate. Women with family had a 73.3 % survival rate. This indicates that men traveling alone likely died.
This graph shows the groups of people by their Age, Sex, and Class, which have been identified as the most important variables. Once again, it is shown that men have lower Survival Rates than women. You can also observe that the classes change closer to first class as survival rate increases.
You can see that most people in the ‘All Survived’ cateogory are young and upper-class. ‘Mostly survived’ shows mostly upper-class women. ‘Mostly died’ shows mostly lower-class men. ‘All Died’ appears to have outliers.
## # A tibble: 13 × 5
## # Groups: cat, Sex, Pclass [13]
## cat Sex Pclass group1 CSCT
## <chr> <chr> <chr> <chr> <int>
## 1 Senior Female First All Survived 3
## 2 Teen Female First All Survived 13
## 3 Child Female Second All Survived 4
## 4 Teen Female Second All Survived 8
## 5 Toddler Female Second All Survived 4
## 6 Baby Female Third All Survived 4
## 7 Senior Female Third All Survived 1
## 8 Baby Male First All Survived 1
## 9 Child Male First All Survived 1
## 10 Toddler Male First All Survived 1
## 11 Baby Male Second All Survived 5
## 12 Child Male Second All Survived 1
## 13 Toddler Male Second All Survived 3
## # A tibble: 8 × 5
## # Groups: cat, Sex, Pclass [8]
## cat Sex Pclass group1 CSCT
## <chr> <chr> <chr> <chr> <int>
## 1 Adult Female First Mostly Survived 43
## 2 MAA Female First Mostly Survived 25
## 3 Adult Female Second Mostly Survived 42
## 4 MAA Female Second Mostly Survived 16
## 5 Teen Female Third Mostly Survived 22
## 6 Toddler Female Third Mostly Survived 8
## 7 Adult Male First Mostly Survived 41
## 8 Baby Male Third Mostly Survived 4
## # A tibble: 14 × 5
## # Groups: cat, Sex, Pclass [14]
## cat Sex Pclass group1 CSCT
## <chr> <chr> <chr> <chr> <int>
## 1 Adult Female Third Mostly Died 47
## 2 Child Female Third Mostly Died 11
## 3 MAA Male First Mostly Died 39
## 4 Senior Male First Mostly Died 14
## 5 Teen Male First Mostly Died 4
## 6 Adult Male Second Mostly Died 59
## 7 MAA Male Second Mostly Died 17
## 8 Senior Male Second Mostly Died 4
## 9 Teen Male Second Mostly Died 10
## 10 Adult Male Third Mostly Died 155
## 11 Child Male Third Mostly Died 12
## 12 MAA Male Third Mostly Died 31
## 13 Teen Male Third Mostly Died 38
## 14 Toddler Male Third Mostly Died 9
## # A tibble: 3 × 5
## # Groups: cat, Sex, Pclass [3]
## cat Sex Pclass group1 CSCT
## <chr> <chr> <chr> <chr> <int>
## 1 Toddler Female First All Died 1
## 2 MAA Female Third All Died 9
## 3 Senior Male Third All Died 4
In conclusion, Sex appears to be the most important variable when it comes to surviving. It almost every category (Class & Age), women survived at a higher rate than men. Second is Age. For men, age is almost perfectly linear. As the age of men increases, surviving rates decrease. For women, age is not as important. Outside of baby girls and elderly women, age does not play a large role for the other age groups. Class has a semi-strong correlation when it comes to surviving. The closer to first class, the better the chances of survival. Port of Embarkation also appears to be significant, but this is likely due to factors such as the differences of class, age, and sex for each port. However, if port of Embarkation were to affect where a passenger was placed on the ship, it could be a very important variable. Lastly, it is important to note that having family on the ship was important. I assume that it would be difficult to get on a life boat if there is nobody to encourage you getting a spot.